Understanding Zip Files and Lambda Functions
That's not possible! Learning new and fun things about the Zip file data structure. Reading time - 2 minutes
July 7, 2020
That’s not possible!
How many times have you said this to yourself, while working on a bug?
I found myself saying it recently. Here at serverless we’ve been hard at work on a killer developer experience called components, and part of my job has been to design and build the onboarding experience.
Components are meant to be small, reusable pieces of infrastructure-as-code (think libraries or node modules, but for cloud infrastructure). People can publish components to a registry and share them with other developers. To help people get packages from the registry we sought to build a simple, one-command initialization system for the framework that would get developers up and running in the most frictionless way possible, like teflon, but for cloud development.
The init
command does a lot of things, but for the sake of brevity, let’s say it fetched a zip archive from the component registry, inflated/extracted it, and pre-configured attributes in the serverless.yml
file for the developer.
The publish
command was mostly the process in reverse. We’d gather up the files in the workspace, generate a new serverless.yml
file based on the existing serverless.yml
file in the workspace, compress them, and push a component to the registry.
The impossible bug
As I began testing the init
command end-to-end, I saw that the serverless.yml
file that was unzipped from the registry seemed to include attributes that we didn’t store in the template.
However - when I manually unzipped the file on my macbook, the serverless.yml
files It appeared to be the newly generated file, exactly as we’d expect the publish
command to do.
I stepped through the code once more and scratched my head - the code says that the original serverless.yml
file lived in the zip file - and that the generated serverless.yml
file was missing!
How could this be possible? How could one copy of an unzipped archive contain different files than ANOTHER copy of the very same archive?!
Proving my assumptions wrong
Eventually I tried using unzip on the file and was greeted with the strangest message:
There were two serverless.yml
files in the same directory inside of the zip file.
Although some filesystems over the years have supported multiple files with the same name in the same directory, on most systems the filename must be unique to the directory the file is in. This is true for HFS, NTFS (unless you really break it), and ext4.
However in a zip archive, files are identified by a metadata header, which includes the filename. This means that it’s totally possible to put two files with the same name in the same zip archive.
Internal structure of a zip file, image by wikipedia
I inadvertently discovered that adm-zip
would silently overwrite one file with the other when extracting into a directory. As it turns out, MacOS does the same thing - however both utilities seemed to pick different files. unzip
will ask you what to do with the duplicate file, which leads me to suspect that this is a known edge case with zip files, and that the decision regarding what to do in this case has been largely left up to the author of the library.
Fixing the bug and closing thoughts
When a user would run the publish
command, internally the framework would build up an array of files to include in the zipped package. Additionally we’d add the serverless.yml
file into the array, modifying it so it could be used as a package in the registry. This inadvertently led to two serverless.yml
files being happily written to the registry zip archive. I simply had to modify the publish
tree-walking algorithm to skip any serverless.yml
files that the author may have inadvertently left in the package root.
It was fun to learn that an assumption I’ve held since my earliest interactions with computers is completely baseless - it’s totally possible to have more than one file with the same name in the same directory (in a zip archive, anyway).